By Selena Shew, Cassandra Zhang, Jamie Jiang
Group CSJ
December 2, 2023
#first I will read in the data
import pandas as pd
airbnb = pd.read_csv("airbnb.csv", parse_dates=['host_since'])
airbnb.head()
| id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13188.0 | https://www.airbnb.com/rooms/13188 | 2.023090e+13 | 2023-09-06 | city scrape | Rental unit in Vancouver · ★4.83 · Studio · 2 ... | Garden level studio suite with garden patio - ... | The uber hip Main street area is a short walk ... | https://a0.muscache.com/pictures/8408188/e1af6... | 51466 | ... | 4.92 | 4.88 | 4.80 | 23-156488 | f | 2 | 2 | 0 | 0 | 1.68 |
| 1 | 13358.0 | https://www.airbnb.com/rooms/13358 | 2.023090e+13 | 2023-09-06 | city scrape | Condo in Vancouver · ★4.68 · 1 bedroom · 1 bed... | <b>The space</b><br />This suites central loca... | NaN | https://a0.muscache.com/pictures/40034c18-0837... | 52116 | ... | 4.79 | 4.92 | 4.65 | 22-311727 | f | 1 | 1 | 0 | 0 | 2.96 |
| 2 | 13490.0 | https://www.airbnb.com/rooms/13490 | 2.023090e+13 | 2023-09-06 | city scrape | Rental unit in Vancouver · ★4.92 · 1 bedroom ·... | This apartment rents for one month blocks of t... | In the heart of Vancouver, this apartment has ... | https://a0.muscache.com/pictures/73394727/79d5... | 52467 | ... | 4.97 | 4.79 | 4.89 | NaN | f | 1 | 1 | 0 | 0 | 0.66 |
| 3 | 14267.0 | https://www.airbnb.com/rooms/14267 | 2.023090e+13 | 2023-09-06 | city scrape | Home in Vancouver · ★4.76 · 1 bedroom · 2 beds... | The Ecoloft is located in the lovely, family r... | We live in the centre of the city of Vancouver... | https://a0.muscache.com/pictures/3646de9b-934e... | 56030 | ... | 4.68 | 4.77 | 4.71 | 21-156500 | t | 1 | 1 | 0 | 0 | 0.22 |
| 4 | 14424.0 | https://www.airbnb.com/rooms/14424 | 2.023090e+13 | 2023-09-06 | city scrape | Guest suite in Vancouver · ★4.69 · 1 bedroom ·... | <b>The space</b><br />Welcome to Strathcona --... | NaN | https://a0.muscache.com/pictures/miso/Hosting-... | 56709 | ... | 4.72 | 4.60 | 4.73 | 19-162091 | f | 4 | 4 | 0 | 0 | 1.63 |
5 rows × 74 columns
#Then I will clean up the data
#I need to drop all of the columns we won't be using
#I need to filter and rename the property types for simplicity
#I will also rename the room types and superhost designation for ease of understanding
#I will need to remove our weird outliers (we have two data points with a daily rate bigger than $3,000 while everthing else is cheaper)
#I also need to drop any rows with missing (NA) values
#keep only the columns we want to examine
airbnb_cleaned = airbnb[['accommodates','price', 'bathrooms', 'beds', 'number_of_reviews', 'neighbourhood_cleansed',
'property_type', 'host_is_superhost', 'review_scores_rating', 'room_type',
'host_response_time', 'host_since', 'latitude', 'longitude']]
#filter to keep only the main property types and rename
keep_prop_types = ['Entire condo', 'Entire rental unit', 'Entire guest suite', 'Entire home', 'Entire townhouse',
'Private room in condo', 'Private room in home', 'Private room in rental unit',
'Private room in guest suite', 'Private room in townhouse']
airbnb_cleaned = airbnb_cleaned.query(
'property_type == @keep_prop_types'
).replace(
{'Entire condo': 'Condo',
'Private room in condo': 'Condo',
'Entire rental unit': 'Rental Suite',
'Private room in rental unit': 'Rental Suite',
'Entire guest suite': 'Guest Suite',
'Private room in guest suite': 'Guest Suite',
'Entire home': 'House',
'Private room in home': 'House',
'Entire townhouse': 'Townhouse',
'Private room in townhouse': 'Townhouse'
}
)
#clean the room types & superhost designation
airbnb_cleaned = airbnb_cleaned.replace(
{"Entire home/apt": 'Entire place',
't': 'True',
'f': 'False'}
)
#remove the dollar sign from the price column
airbnb_cleaned['price'] = airbnb_cleaned['price'].str.replace(',', '').str.replace('$', '').astype(float)
#filter out the outlier: where price is greater than $8,000
airbnb_cleaned = airbnb_cleaned.query('price <= 3000.0')
#finally, I'll drop all rows with missing values
airbnb_cleaned = airbnb_cleaned.dropna()
airbnb_cleaned.head()
C:\Users\User\AppData\Local\Temp\ipykernel_28356\1309388831.py:43: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
airbnb_cleaned['price'] = airbnb_cleaned['price'].str.replace(',', '').str.replace('$', '').astype(float)
| accommodates | price | bathrooms | beds | number_of_reviews | neighbourhood_cleansed | property_type | host_is_superhost | review_scores_rating | room_type | host_response_time | host_since | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 151.0 | 1 | 2.0 | 277 | Riley Park | Rental Suite | True | 4.83 | Entire place | within an hour | 2009-11-04 | 49.24773 | -123.10509 |
| 1 | 2 | 215.0 | 1 | 1.0 | 476 | West End | Condo | False | 4.68 | Entire place | within an hour | 2009-11-07 | 49.28201 | -123.12669 |
| 2 | 2 | 150.0 | 1 | 1.0 | 99 | Kensington-Cedar Cottage | Rental Suite | True | 4.92 | Entire place | within an hour | 2009-11-08 | 49.25622 | -123.06607 |
| 4 | 2 | 135.0 | 1 | 1.0 | 269 | Downtown Eastside | Guest Suite | False | 4.69 | Entire place | within a few hours | 2009-11-23 | 49.27921 | -123.08835 |
| 6 | 6 | 100.0 | 1 | 4.0 | 3 | Grandview-Woodland | House | False | 4.00 | Entire place | a few days or more | 2009-11-29 | 49.26339 | -123.07145 |
airbnb_cleaned.shape
(4286, 14)
airbnb_cleaned.describe()
| accommodates | price | beds | number_of_reviews | review_scores_rating | latitude | longitude | |
|---|---|---|---|---|---|---|---|
| count | 4286.000000 | 4286.000000 | 4286.000000 | 4286.000000 | 4286.000000 | 4286.000000 | 4286.000000 |
| mean | 3.588661 | 225.643724 | 1.949837 | 51.077462 | 4.769606 | 49.262193 | -123.110515 |
| std | 2.030560 | 166.831182 | 1.151349 | 78.005589 | 0.389521 | 0.020632 | 0.038710 |
| min | 1.000000 | 30.000000 | 1.000000 | 1.000000 | 0.000000 | 49.202960 | -123.214820 |
| 25% | 2.000000 | 124.000000 | 1.000000 | 5.000000 | 4.720000 | 49.249278 | -123.130748 |
| 50% | 3.000000 | 181.000000 | 2.000000 | 20.000000 | 4.880000 | 49.267743 | -123.111145 |
| 75% | 4.000000 | 271.000000 | 2.000000 | 64.750000 | 5.000000 | 49.278884 | -123.086525 |
| max | 16.000000 | 2257.000000 | 13.000000 | 916.000000 | 5.000000 | 49.294360 | -123.023903 |
#Finally I will export the cleaned dataframe to csv so that my group members can use it
airbnb_cleaned.to_csv('cleaned_airbnb_data_final.csv', index=False)
import altair as alt
import geojson
# Handle large data sets without embedding them in the notebook
#alt.data_transformers.enable('data_server')
import vegafusion as vf
vf.enable_widget()
# Default Rendering
alt.renderers.enable('default')
RendererRegistry.enable('default')
slider2 = alt.binding_range(min=0.1, max = 1.0, step=0.1, name='Opacity:')
op_opacity = alt.param(value = 0.7, bind=slider2)
brush_select = alt.selection_interval(encodings = ['x'], empty = False)
task_1 = alt.Chart(airbnb_cleaned, width = 400, height = 250, title = "Task 1: Daily Price Vs. Number of Beds, Bathrooms, People Accommodated").mark_circle(opacity=op_opacity, stroke='black', strokeWidth=0.5).encode(
x = alt.X("beds:Q", axis=alt.Axis(grid=False, ticks=False)).title("Number of Beds"),
y = alt.Y("price:Q", axis=alt.Axis(grid=False, ticks=False)).title("Price Per Day"),
size = alt.Size("bathrooms:Q").title("Number of Bathrooms"),
color = alt.condition(brush_select, 'accommodates:Q', alt.value('lightgray')),
tooltip = alt.Tooltip(['price:Q', 'beds:Q', 'bathrooms:Q'])
).add_params(
op_opacity,
brush_select
)
task_1
The plot above shows how the price changes with the number of beds and bathrooms available. There is a selection interval as well as an opacity slider to more easily see the individual points that are overlapping.
accomm_chart = alt.Chart(airbnb_cleaned, width = 400, height = 250, title = "Daily Price Vs. Number of People Accommodated").mark_circle(size=150, stroke='black', strokeWidth=1).encode(
x = alt.X("accommodates:Q", axis=alt.Axis(grid=False, ticks=False), scale=alt.Scale(zero=False)).title("Number of People Accommodated"),
y = alt.Y("price:Q", axis=alt.Axis(grid=False, ticks=False), scale=alt.Scale(zero=False)).title("Price Per Day"),
tooltip = alt.Tooltip(['price:Q', 'accommodates:Q'])
)
accomm_chart
The plot above shows how the price varies with the number of people that can be acommodated.
task_1_vis = accomm_chart.encode(color = alt.condition(brush_select, alt.value('#4ba670'), alt.value('lightgray'))).add_params(brush_select) | task_1
task_1_vis
Now we have put both plots together to address the task. There is bidirectional linking between them via a selection interval, which shows the corresponding number of people that can be acommodated alongside the corresponding number of beds and bathrooms available for each listing.
brush = alt.selection_interval(encodings = ['y'], empty = True)
selection = alt.selection_point(fields=['room_type'])
color = alt.condition(
selection,
alt.Color('room_type:N').legend(None),
alt.value('lightgray')
)
tick_plot = alt.Chart(airbnb_cleaned, title = "Task 2: Exploring How Review Ratings Vary With Different Property and Room Types").mark_tick(size = 20).encode(
x = alt.X('property_type:N', title="Property Type",axis=alt.Axis(labelAngle=-45)),
y = alt.Y('review_scores_rating', title= "Review Scores"),
color = color
).properties(
width = 400,
height = 250
).add_params(brush)
legend = alt.Chart(airbnb_cleaned).mark_point().encode(
alt.Y('room_type:N', axis=alt.Axis(orient='right')),
color=color
).add_params(
selection
)
bars = alt.Chart(airbnb_cleaned, title = "Count of Combination of Property & Room Types").mark_bar().encode(
x= alt.X('count():Q', title = "Counts of the combination"),
y= alt.Y('property_type:N', title = "Property Type"),
color='room_type:N',
tooltip=['property_type:N', 'count():Q', 'room_type:N']
).properties(
width=400,
height=250
).transform_filter(
brush
)
task_2_vis = tick_plot|legend|bars
task_2_vis
Task 2 is addressed on the left, and can be filtered by the room type. The plot on the right shows the counts for each combination of property and room type.
selection = alt.selection_point(fields=['host_is_super_host', 'host_response_time'])
color = alt.condition(
selection,
alt.Color('host_is_superhost:N').legend(None),
alt.value('lightgray')
)
scatter = alt.Chart(airbnb_cleaned, title = "Task 3: How Review Ratings Vary With Number of Reviews, Host Response Time, & Superhost Status").mark_point(size=88).encode(
x='number_of_reviews:Q',
y='review_scores_rating:Q',
color=color,
tooltip = ['number_of_reviews:Q', 'review_scores_rating:Q', 'host_response_time:N', 'host_is_superhost:N']
).properties(
width = 400,
height = 250
)
legend = alt.Chart(airbnb_cleaned, title= "Filter Combo of Host Response Time & Superhost Status").mark_rect().encode(
alt.Y('host_is_superhost').axis(orient='right'),
x=alt.X('host_response_time',axis=alt.Axis(labelAngle=-45)),
tooltip = ['host_response_time:N', alt.Tooltip('count():Q')],
color=color
).add_params(
selection
).properties(
width = 400,
height = 100
)
task_3_vis = scatter | legend
task_3_vis
Here, the filter on the right shows the number as well as review rating for each AirBnB on the left, depending on the host response time and whether the host is a designated superhost or not.
First we needed to get the geographical map of Vancouver. We found the relevant geojson file here: https://github.com/blackmad/neighborhoods/blob/master/vancouver.geojson
# the code is adapted from https://stackoverflow.com/questions/74168389/can-mark-geoshape-be-used-for-canadian-provinces-cities
can_prov_file = 'vancouver.geojson'
with open(can_prov_file) as f:
var_geojson = geojson.load(f)
data_geojson = alt.InlineData(values=var_geojson, format=alt.DataFormat(property='features',type='json'))
# chart object
vancouver = alt.Chart(data_geojson).mark_geoshape(fill='lightgray',
stroke='white'
).project(
type='identity', reflectY=True
).properties(height=300, width = 800)
points = alt.Chart(airbnb_cleaned).mark_circle(size=30,opacity=0.8).encode(
latitude='latitude:Q',
longitude='longitude:Q',
color=alt.Color('review_scores_rating', scale = alt.Scale(scheme='plasma',domain=[5,4]),
legend = alt.LegendConfig(orient = 'bottom')).title('Review Rating'),
tooltip=[alt.Tooltip('review_scores_rating', title='Review Rating'), alt.Tooltip('neighbourhood_cleansed', title='Neighbourhood')]
)
vancouver_map = vancouver + points
#vancouver_map
genres = ['Entire place', 'Private room']
room_type_dropdown = alt.binding_select(options=genres, name="Room Type")
room_type_select = alt.selection_point(fields=['room_type'], bind=room_type_dropdown)
filter_genres = points.add_params(
room_type_select
).transform_filter(
room_type_select
).properties(title="Task 4: How Airbnb Ratings Vary With Neighbourhood & Room Type")
map_review_rating = vancouver + filter_genres
map_review_rating
The map above shows the review ratings for each AirBnB in Vancouver. There is a filter option to show the review ratings for just each room type.
heat_map = alt.Chart(airbnb_cleaned).mark_rect().encode(
color = alt.Color('mean(review_scores_rating)',scale = alt.Scale(scheme='plasma',domain = [5,4]), legend = None),
x = alt.X('neighbourhood_cleansed', axis=alt.Axis(labelAngle=-45)).title('Neighbourhood'),
y = alt.Y('room_type').title('Room Type'),
tooltip=alt.Tooltip(['mean(review_scores_rating)'], format='.2f')
).properties(height = 60, width = 800)
heat_map
The heat map shown above displays the average review rating for each combination of room type and neighbourhood.
task_4_vis = alt.vconcat(map_review_rating, heat_map)
task_4_vis
The final image above combines both visualizations together to answer Task 2.
Here we put all of our visualizations together:
# dashboard = alt.vconcat(task_1_vis, task_3_vis, task_4_vis, task_2_vis)
# dashboard
display(task_1_vis)
display(task_2_vis)
display(task_3_vis)
display(map_review_rating)
display(heat_map)
*Please note that we did not use the alt.vconcat() or alt.hconcat() methods as that caused our sliders, buttons, and filters to randomly migrate to the bottom.